Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Path and glob instead of walk_files #1069

Merged
merged 17 commits into from
Dec 15, 2020

Conversation

krishnakalyan3
Copy link
Contributor

@krishnakalyan3 krishnakalyan3 commented Dec 5, 2020

Hello @vincentqb,

I wanted to take an initial stab at this. I have made a minor modification for

  • yesno
  • librispeech
  • libritts
  • speechcommands

Looks like tedlium, ljspeech and vctk (VCTK_092) don't have walk_files

cc #1051

@krishnakalyan3
Copy link
Contributor Author

krishnakalyan3 commented Dec 5, 2020

To test

import torchaudio
yesno_data = torchaudio.datasets.YESNO('./', download=True)
yesno_data._walker

ls_data = torchaudio.datasets.LIBRISPEECH(".", download=True)
ls_data._walker

sc_data =  torchaudio.datasets.SPEECHCOMMANDS(".", download=True)
sc_data._walker

@krishnakalyan3
Copy link
Contributor Author

yesno_data._walker

Before:
['0_0_0_0_1_1_1_1',
'0_0_0_1_0_0_0_1',
'0_0_0_1_0_1_1_0',
'0_0_1_0_0_0_1_0',
'0_0_1_0_0_1_1_0', ...]

After:
['0_0_0_0_1_1_1_1',
 '0_0_0_1_0_0_0_1',
 '0_0_0_1_0_1_1_0',
 '0_0_1_0_0_0_1_0',
 '0_0_1_0_0_1_1_0', ...]

walker = walk_files(
self._path, suffix=self._ext_audio, prefix=False, remove_suffix=True
)
walker = sorted([str(p.stem) for p in Path(self._path).glob('*/*/*'+self._ext_audio)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing whitespace around arithmetic operator :)

walker = walk_files(
self._path, suffix=self._ext_audio, prefix=False, remove_suffix=True
)
walker = sorted([str(p.stem) for p in Path(self._path).glob('*.wav')])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: + self._ext_audio instead of .wav

@vincentqb
Copy link
Contributor

Looks like tedlium does not contain walk_files/filter and I am assuming that ljspeech should also be added to the list?.

ljspeech also doesn't use walk_files, see here.

@vincentqb
Copy link
Contributor

Thanks for looking into this! gave minor comments, but LGTM so far :)

@krishnakalyan3
Copy link
Contributor Author

@vincentqb looks like vctk does not have walk_files.

Copy link
Collaborator

@mthrok mthrok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good. Could you also remove the definition of walk_files?

walker = walk_files(
self._path, suffix=self._ext_audio, prefix=False, remove_suffix=True
)
walker = sorted([str(p.stem) for p in Path(self._path).glob('*/*/*' + self._ext_audio)])
self._walker = list(walker)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you get rid of list(walker) as it is redundant?

I would do

self._walker =  [str(p.stem) for p in Path(self._path).glob(f'*/*/*{self._ext_audio}')]
self._walker.sort()

Copy link
Contributor

@vincentqb vincentqb Dec 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, there's no need to make a copy of the list. It seems like using iterator+sorted is similar? Is this a fair comparison?

In [1]: %timeit sorted(range(1000, 0, -1))                                                                                                                     
14.2 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [2]: %timeit sorted(list(range(1000, 0, -1)))                                                                                                               
16.4 µs ± 370 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [3]: def func(): 
   ...:     a = list(range(1000, 0, -1)) 
   ...:     a.sort() 
   ...:                                                                                                                                                        

In [4]: %timeit func()                                                                                                                                         
15.4 µs ± 172 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Copy link
Contributor

@vincentqb vincentqb Dec 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the PR with sorted(generator), since it turns out that sorted(generator) creates the list and then updates it in place anyway. Once the tests pass, I'll merge this PR as it is. Thanks @krishnakalyan3 for the quick follow-up :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List comprehension is not a generator, please revert it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, okay, you changed the list comprehension to a generator.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The next time, please consult it with me before doing so.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot find a reference that says sorted is in-place. can you give the reference for that?

Copy link
Contributor

@vincentqb vincentqb Dec 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See implementation of sorted. Regardless, the test above points to that implementation to be faster.

@vincentqb
Copy link
Contributor

vincentqb commented Dec 15, 2020

Looks like tedlium, ljspeech and vctk (VCTK_092) don't have walk_files

Fixes: #1051

VCTK does have walk_files here. For simplicity, let's update VCTK and remove walk_files as a separate PR. Thanks again @krishnakalyan3 for the pull request :)

@codecov
Copy link

codecov bot commented Dec 15, 2020

Codecov Report

Merging #1069 (6eddeb2) into master (3691b8e) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@      Coverage Diff       @@
##   master   #1069   +/-   ##
==============================
==============================

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3691b8e...ce06026. Read the comment docs.

Copy link
Contributor

@vincentqb vincentqb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for the work! We can do VCTK and removing walk_files as follow-ups. I'll go ahead and merge this once the tests are passed.

@vincentqb
Copy link
Contributor

LGTM! Merging. Thanks again @krishnakalyan3!

@vincentqb vincentqb merged commit d25a4dd into pytorch:master Dec 15, 2020
@krishnakalyan3
Copy link
Contributor Author

krishnakalyan3 commented Dec 15, 2020

Thanks for all the awesome feedback gentlemen.

The VCTK corpus which uses walk_files is deprecated in favor of VCTK_092. At this moment even if I use VCTK to forces me to use VCTK_092 dataset. Hence, I was not sure what to do.

@mthrok
Copy link
Collaborator

mthrok commented Dec 15, 2020

Thanks for all the awesome feedback gentlemen.

The VCTK corpus which uses walk_files is deprecated in favor of VCTK_092. At this moment even if I use VCTK to forces me to use VCTK_092 dataset. Hence, I was not sure what to do.

@krishnakalyan3

It was not explained in the original issue, but the fundamental reason of these changes you contributed is removing walk_files. The use of walk_files makes it ambiguous who is responsible to locate the files. (Dataset class? or utility?), and in fact just glob-ing everything is not the right problem being solved in implementing Dataset, because if you have a specific dataset you consider to access, then the directory structure and file locations are determined. No need to do possibly-infinite recursion. Each Dataset implementation should be glob-ing the right set of files it requires.

Unfortunately, the original VCTK dataset is no longer publicly available and I do not have a copy so I do not know the expected structure of directory, but we can still move the glob logic to VCTK dataset so that VCTK dataset is responsible in locating the files. Would you like to work on it? If you do not have a time, I will file a separate issue for that.

@krishnakalyan3
Copy link
Contributor Author

@mthrok thanks for the explanation. I will work on the PR.

mpc001 pushed a commit to mpc001/audio that referenced this pull request Aug 4, 2023
* Adds files for minGPT training with DDP

* filtered-clone, update script path, update readme

* add refs to karpathy's repo

* add training data

* add AMP training

* delete raw data file, update index.rst

* Update gpt2_train_cfg.yaml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants